The Contaminant sub-goal of the Clean Water goal captures the degree to which local waters are unpolluted by contaminants. This sub-goal scores highest when the contamination level is below a threshold, which is defined by the Marine Framework Directive. For the BHI three contaminants indicators are proposed, describing different aspects of toxicity: dioxin and dioxin like compounds, polychlorinated biphenyl compounds (PCBs), and perfluorooctanesulfonic acid (PFOS). In addition, a penalty factor has been applied to account for the fact that current monitoring programs do not cover all the registered and harmful contaminants.
All contaminant data were downloaded from the open-accessible ICES database (see below).
Table to calculate the Concerning Substances indicator (penalty factor):
Non-dioxin like PCBs: sum of congeners (28, 52, 101, 138, 153, 180). Target is set at the threshold of 75 ug/kg ww (wet weight) fish muscle.
This is similar to the ICES-7 except that PCB 118 is excluded, since it is metabolized by mammals.
75 ng/g wet weight is the EU threshold for fish muscle. See Section 5 Annex, 5.3. This threshold was also agreed upon as GES boundary at the meeting of the Working Group on the State of the Environment and Nature Conservation April 11-15, 2016. Recevied the draft report from Elisabeth Nyberg.
Dioxin and dioxin-like compounds. Target is set at 0.0065 TEQ ug/kg ww fish, crustaceans or molluscs (source of target: EQS biota human health). Secondary GES boundary: CB-118 24 ug/kg lw fish liver or muscle (source: EAC).
This threshold was agreed upon as GES indicator at the meeting of the Working Group on the State of the Environment and Nature Conservation April 11-15, 2016. Recevied the draft report from Elisabeth Nyberg.
This is consistent with the EU human health thresholds for dioxin and dioxin-like compounds - 6.5 pg/g
TEQ values from the World Health Organization 2005
According to HELCOM PFOS core indicator document, p.3, the “GES boundary is set to 9.1 μg/kg wet weight (or 9.1 ng/g ww) with the protection goal of human health”.
“The GES boundary is an environmental quality standard (EQS), derived at EU level as a substance included on the list of priority substances under the Water Framework Directive (European Commission 2000, 2013). GES, in accordance with the MSFD is defined as ‘concentrations of contaminants at levels not giving rise to pollution effects’. EQS are derived from ecotoxicological studies to protect freshwater and marine ecosystems from potential adverse effects of chemicals, as well as adverse effects on human health via drinking water and food from aquatic environments. Quality Standards (QS) are derived for different protection goals, i.e.: pelagic and benthic communities, top-predators in these ecosystems, and human health. The most stringent of these QS is the basis for the EQS. The EQS boundary for PFOS is based on the QS set for biota to protect human health (9.1 μg/ kg fish ww), defined for edible parts in fish. For harmonization purposes the EC Guidance Document No. 32 on biota monitoring (the implementation of EQS biota) under the WFD was developed (European Commission 2014). This guidance document recommends that the results from the monitoring should be standardized to represent fish at a trophic level of 4, which is an estimate of the general trophic level in commercial fish in Europe. The recommendation to obtain PFOS data in fish at a trophic level of 4 is to adjust the values from monitoring in accordance with trophic magnification factors and trophic level.” \(-\) HELCOM PFOS core indicator document, p.8
HELCOM core indicator report uses liver PFOS concentrations converted to muscle equivalent values as in Faxneld et al. 2014b.
Only a few environmental quality standards (EQS) are to date defined in the Baltic Sea for marine surface sediments. The Norwegian threshold values are often used when a comparison with ecotoxicological effects is needed. In particular, the Norwegian Environment Agency has defined EQS in sediment for 28 EU priority substances.
PCB7 (sum PCB 28, 52, 101, 118, 138, 153, and 180): Target is set at 4.1 TS ug/kg dw. Environmental quality classification of water bodies
Recommended threshold values are given only for total PCB7, and not for each individual congener. This is because toxicity data are available for only a minority of congeners.
Dioxin and dioxin-like compounds. Target is set at 0.00086 TEQ ug/kg dw (Total TEQ). Norwegian Environment Agency TEQ values from the World Health Organization 2005
External advisors/goalkeepers: Anna Sobek
This prep document is used to generate and explore the following data layers:
cw_con_pcb_bhi2019.csvcw_con_pfos_bhi2019.csvcw_con_dioxin_bhi2019.csvcw_con_penalty_bhi2019.csvThese are saved to the layers/v2019 folder. Intermediate datasets saved to data/CW/contaminants/v2019/intermediate include: pcb_bio_cleaned.csv, pcb_sed_cleaned.csv, pfos_bio_cleaned.csv, dioxin_bio_cleaned.csv and dioxin_sed_cleaned.csv. All these are derived from or informed by the following raw datasets.
PCB Data in Biota
| Option | Specification |
|---|---|
| Year | 1990-2017 |
| Purpose of monitoring | All |
| Country | All |
| Monitoring Program | All |
| Parameter Group | Chlorobiphenyls |
| Reporting Laboratory | All |
| Analytical laboratory | All |
| Geographical Areas | (HELCOM) ALL HELCOM sub-basins |
| Data Code | |
|---|---|
| CB180 | 2,2’,3,4,4’,5,5’-heptachlorobiphenyl |
| SCB7 | sum of CBs.- Sum of |
| CB28 | 2,4,4’-trichlorobiphenyl |
| CB52 | 2,2’,5,5’-tetrachlorobiphenyl |
| CB101 | 2,2’,4,5,5’-pentachlorobiphenyl |
| CB118 | 2,3’,4,4’,5-pentachlorobiphenyl |
| CB153 | 2,2’,4,4’,5,5’-hexachlorobiphenyl |
| CB138 | 2,2’,3,4,4’,5’-hexachlorobiphenyl |
| PCB | polychlorinated biphenyls - Deprecated- Report as single PCBs |
| CB194 | 2,2’,3,3’,4,4’,5,5’-octachlorobiphenyl |
| CB105 | 2,3,3’,4,4’-pentachlorobiphenyl |
| CB110 | 2,3,3’,4’,6-pentachlorobiphenyl |
| CB126 | 3,3’,4,4’,5-pentachlorobiphenyl |
| CB128 | 2,2’,3,3’,4,4’-hexachlorobiphenyl |
| CB149 | 2,2’,3,4’,5’,6-hexachlorobiphenyl |
| CB151 | 2,2’,3,5,5’,6-hexachlorobiphenyl |
| CB156 | 2,3,3’,4,4’,5-hexachlorobiphenyl |
| CB157 | 2,3,4,3’,4’,5’-hexachlorobiphenyl |
| CB170 | 2,2’,3,3’,4,4’,5-heptachlorobiphenyl |
| CB44 | 2,2’,3,5’-tetrachlorobiphenyl |
| CB49 | 2,2’,4,5’-tetrachlorobiphenyl |
| CB99 | 2,2’,4,4’,5-pentachlorobiphenyl |
| CB77 | 3,3’,4,4’-tetrachlorobiphenyl |
| CB169 | 3,3’,4,4’,5,5’-hexachlorobiphenyl |
| CB31 | 2,4’,5-trichlorobiphenyl |
| CB81 | 3,4,4’,5-tetrachlorobiphenyl |
| SCB | sum of CBs.- Specify in method data |
| CB189 | 2,3,3’,4,4’,5,5’-heptachlorobiphenyl |
| CB114 | 2,3,4,4’,5-pentachlorobiphenyl |
| CB123 | 1,1’-Biphenyl, 2,3’,4,4’,5’-pentachloro- |
| CB167 | 2’,3,4,4’,5,5’-hexachlorobiphenyl |
| CB187 | 2,2’,3,4’,5,5’,6-heptachlorobiphenyl |
| CB66 | 2,3’,4,4’-tetrachlorobiphenyl |
| CB60 | 1,1’-Biphenyl, 2,3,4,4’-tetrachloro- |
| CB141 | 2,2’,3,4,5,5’-hexachlorobiphenyl |
| CB74 | 2,4,4’,5-tetrachlorobiphenyl |
| CB206 | 2,2’,3,3’,4,4’,5,5’,6-nonachlorobiphenyl |
| CB33 | 2’,3,4-trichlorobiphenyl |
| CB18 | 2,2’,5-trichlorobiphenyl |
| CB183 | 2,2’,3,4,4’,5’,6-heptachlorobiphenyl |
| CB47 | 2,2’,4,4’-tetrachlorobiphenyl |
| CB209 | 2,2’,3,3’,4,4’,5,5’,6,6’-decachlorobiphenyl |
| CB51 | 2,2’,4,6’-Tetrachlorobiphenyl |
| CB122 | 1,1’-Biphenyl, 2,3,3’,4’,5’-pentachloro- |
| CB138+163 | 2,2’,3,4,4’,5’-hexachlorobiphenyl |
PCBs in Sediment
| Option | Specification |
|---|---|
| Year | All8 |
| Purpose of monitoring | All |
| Country | All |
| Monitoring Program | All |
| Parameter Group | Chlorobiphenyls |
| Reporting Laboratory | All |
| Analytical laboratory | All |
| Geographical Areas | (ICES) All ICES Areas |
PFOS in Biota
| Option | Specification |
|---|---|
| Year | 2005-2017 (2005 earliest allowed) |
| Purpose of monitoring | All |
| Country | All |
| Monitoring Program | All |
| Parameter Group | Organofluorines |
| Reporting Laboratory | All |
| Analytical laboratory | All |
| Geographical Areas | (HELCOM) ALL HELCOM sub-basins |
Dioxins in Biota
| Option | Specification |
|---|---|
| Year | 1998-2017 (1998 earliest allowed) |
| Purpose of monitoring | All |
| Country | All |
| Monitoring Program | All |
| Parameter Group | Dioxins |
| Reporting Laboratory | All |
| Analytical laboratory | All |
| Geographical Areas | (HELCOM) ALL HELCOM sub-basins |
| Data Code | |
|---|---|
| CDD1N | 1 2 3 7 8-pentachlorodibenzo-p-dioxin |
| CDD4X | 1 2 3 4 7 8-hexachlorodibenzo-p-dioxin |
| CDD6P | 1 2 3 4 6 7 8-heptachlorodibenzo-p-dioxin |
| CDD6X | 1 2 3 6 7 8-hexachlorodibenzo-p-dioxin |
| CDD9X | 1 2 3 7 8 9-hexachlorodibenzo-p-dioxin |
| CDDO | 1 2 3 4 6 7 8 9-octachlorodibenzo-p-dioxin |
| CDF2N | 2 3 4 7 8-pentachlorodibenzofuran |
| CDF2T | 2 3 7 8-tetrachloro-dibenzofuran |
| CDF4X | 2 3 4 6 7 8-hexachlorodibenzofuran |
| CDF6P | 1 2 3 4 6 7 8-heptachlorodibenzofuran |
| CDF6X | 1 2 3 6 7 8-hexachlorodibenzofuran |
| CDF9P | 1 2 3 4 7 8 9-heptachlorodibenzofuran |
| CDF9X | 1 2 3 7 8 9-hexachlorodibenzofuran |
| CDFDN | 1 2 3 7 8/1 2 3 4 8-pentachloro-dibenzofuran |
| CDFO | octachloro-dibenzofuran (group) |
| TCDD | 2 3 7 8-tetrachlorodibenzo-p-dioxin |
Dioxins in Sediment
| Option | Specification |
|---|---|
| Year | All |
| Purpose of monitoring | All |
| Country | All |
| Monitoring Program | All |
| Parameter Group | Dioxins |
| Reporting Laboratory | All |
| Analytical laboratory | All |
| Geographical Areas | (ICES) All ICES Areas |
2.1.4 Concerning Substances Datasets
A list of concerning substances was obtained from European Chemical Agency Candidate List of substances of very high concern for Authorisation (published in accordance with Article 59(10) of the REACH Regulation). These substances were compared with those in the ICES database. Datasets for substances with ICES records were downloaded and saved in a similar manner as the PCBs, PFOS, and Organofluorines datasets.
This preliminary data wrangling includes steps to harmonize and check the 5 datasets used in the Contaminants subgoal:
More details about the standardization and convertions in section 2.2.2.
## root location of the raw data
dir_rawdata <- file.path(dir_B, "Goals", "CW", "CON")
lapply(
list(
c("ContaminantsBiota_20Feb07_PCBs", "pcb_rawdata_bio"),
c("ContaminantsSediment_PCBs", "pcb_rawdata_sed"),
c("ContaminantsBiota_20Feb07_PFOS", "pfos_rawdata_bio"),
c("ContaminantsBiota_20Feb07_Dioxins", "dioxin_rawdata_bio"),
c("ContaminantsSediment_Dioxins", "dioxin_rawdata_sed")
),
function(x){
read_csv(
file.path(dir_rawdata, x[1], paste(x[1], "csv", sep = ".")),
## where not read in correctly, column types must be specified ...
col_types = cols(
.default = col_character(),
MYEAR = col_number(),
Latitude = col_number(),
Longitude = col_number(),
NOINP = col_number(),
Value = col_number(),
DETLI = col_number(),
LMQNT = col_number(),
UNCRT = col_number(),
tblAnalysisID = col_number(),
tblParamID = col_number(),
tblBioID = col_number(),
tblSampleID = col_number()
)
) %>% assign(x = x[2], envir = .GlobalEnv)
}
)
Renaming columns and some variables within columns for clarity. See references: Contaminants in Biota and Contaminants in Sediment. For more information about the ICES reporting format for environmental data see this document, also referenced in this document from HELCOM BalticBOOST workshop on the HOLAS II hazardous substance assessment.
rename_vars <- function(dataset){
df_vars_renamed <- dataset %>%
## create date, month, year, day columns
dplyr::mutate(date = as.Date(DATE, "%d/%m/%Y")) %>%
dplyr::mutate(
day = lubridate::day(date),
month = lubridate::month(date),
year = lubridate::year(date)) %>%
## change column names
dplyr::rename(
country = Country, report_institute = RLABO,
station = STATN,
monit_year = MYEAR, date_ices = DATE,
latitude = Latitude, longitude = Longitude,
param_group = PARGROUP, variable = PARAM,
value = Value, unit = MUNIT,
species = Species, sex_specimen = SEXCO, num_indiv_subsample = NOINP,
test_organism = `Test Organism`,
matrix_analyzed = MATRX, not_used_in_datatype = NODIS,
monit_program = MPROG, monit_purpose = PURPM,
basis_determination = BASIS, qflag = QFLAG,
vflag = VFLAG, detect_lim = DETLI,
quant_lim = LMQNT, uncert_val = UNCRT, method_uncert = METCU,
analyt_lab = ALABO, ref_source = REFSK,
method_storage = METST, method_pretreat = METPT,
method_pur_sep = METPS, method_chem_fix = METFP,
method_chem_extract = METCX, method_analysis = METOA,
formula_calc = FORML,
sub_samp_id = SUBNO, bulk_id = BULKID,
sampler_type = SMTYP, factor_compli_interp = FINFL,
analyt_method_ref = tblAnalysisID,
measurement_ref = tblParamID,
samp_ref = tblSampleID
) %>%
## improve clarity of 'basis_determination' and 'matrix_analyzed' column content
mutate(basis_determination = case_when(
basis_determination == "L" ~ "lipid weight",
basis_determination == "W" ~ "wet weight",
basis_determination == "D" ~ "dry weight"
)) %>%
mutate(matrix_analyzed = case_when(
matrix_analyzed == "LI" ~ "liver",
matrix_analyzed == "MU" ~ "muscle",
matrix_analyzed == "WO" ~ "wholeorganism"
))
return(df_vars_renamed)
}
## same names and structures so can apply function to all datasets
lapply(
list("pcb_rawdata_bio", "pfos_rawdata_bio", "dioxin_rawdata_bio"),
function(x){
renameddf <- rename_vars(get(x)) %>%
rename(sub_samp_ref = tblBioID)
assign(str_remove(x, "_rawdata"), renameddf, envir = .GlobalEnv)
}
)
lapply(
list("pcb_rawdata_sed", "dioxin_rawdata_sed"),
function(x){
renameddf <- rename_vars(get(x)) %>%
rename(press_depth = DEPHU, depth_low = DEPHL) %>%
mutate(press_depth = as.numeric(press_depth), depth_low = as.numeric(depth_low))
assign(str_remove(x, "_rawdata"), renameddf, envir = .GlobalEnv)
}
)
The main objective of the code below is to clean the contaminants datases set so they contain only values based on wet weight for biota or dry weight for sediment-matrix measurements, in standardized units of ug/kg. This includes the following pre-processing steps:
B-BIO data from the other param group data (OC-CB, OC-DX, or O-FL for PCB, Dioxin, PFOS resp.) for use in lipid basis to wet weight conversion for biota or for conversion to dry weight for sediments; also for calculation of muscle equivalent in PFOS datasub_samp_ref value groupings for biota or samp_ref value groupings for sediment-matrix measurementsThe datasets each contain two main categories (param_groups) one of which is BBIO. BBIO consists of variables with information about the sample like dry/lipid/wet weight percentage, age, weight and length. The second category contains the congeners concentration information. In stages, these are each manipulated and later rejoined by the sample (samp_ref or sub_samp_ref) ID numbers.
Check Sediment data Coverage before Filtering and Standardizing Units
Setting the cutoff to 0.05m (concentration measurements from sediment samples taked from within 5cm of the surface) means we retain coverage in the Gulf of Finland for PCBs indicator, and many more measurements in Germany around Kiel. Setting the depth cutoff to any smaller value would result in poor coverage.
## basemap with baltic countries borders and BHI regions with ID numbers
bhi_rgns_simple <- rmapshaper::ms_simplify(
input = sf::st_read(
dsn = file.path(dirname(dir_B), "Shapefiles", "BHI_shapefile_corrected"),
layer = "BHI_shapefile_corrected",
quiet = TRUE
)) %>% sf::st_as_sf()
basemap <- ggplot2::ggplot() +
geom_sf(
data = rnaturalearth::ne_countries(scale = "medium", returnclass = "sf") %>%
sf::st_crop(xmin = 0, xmax = 40, ymin = 53, ymax = 67),
fill = "ivory", color = "lightsteelblue",
size = 0.1, alpha = 0.8
) +
geom_sf(data = bhi_rgns_simple, fill = NA, size = 0.15, color = "lightsteelblue") +
scale_x_continuous(limit = c(4, 32)) +
scale_y_continuous(limit = c(53.5, 66)) +
theme(panel.background = element_rect(fill = "#F8FBFC", color = "#E2EEF3"))
standardize_con_data <- function(data_renamed, matrix_biota = TRUE, contam_param, rm_vars = NULL){
contam_param <- substr(contam_param, str_length(contam_param)-1, str_length(contam_param))
## what are the specific parameters?
## unique parameter groups and associated variables:
chk_params <- data_renamed %>%
select(param_group, variable) %>%
distinct() %>%
arrange(param_group, variable)
## filter and/or remove unneeded variables from data ----
message("STEP 1: filter and/or unselect unneeded variables and congeners from dataset\n")
df <- data_renamed %>%
## remove 'suspect' and other non-'acceptable': https://vocab.ices.dk/?ref=58
filter(!vflag %in% c("S", "C")) %>%
## remove certain variables-- summarized data or depreciated codes
filter(!variable %in% rm_vars)
if(contam_param != "FL"){
## remove samples for liver tissue, keep muscle, and whole organism for length x weight
df <- filter(df, matrix_analyzed != "liver" | is.na(matrix_analyzed))
}
## for PFOS (contaminant pargroup 'O-FL'), keep liver measurements--
## all compounds except parameter 'PFOS' measured only in the liver
## DRYWT% and EXLIP% have been measured for both muscle and liver
## sediment versus biota datasets, different sample/id variables
sampvars <- c("sub_samp_id", "samp_ref", "sub_samp_ref")
if(!matrix_biota){
sampvars <- setdiff(sampvars, "sub_samp_ref")
}
## check alignment of all measurements with sub_samp_ref for biota, samp_ref for sediment
## (most unique besides measurement_ref)
## there are more unique measurement_ref but these are for different vars w/in sub_samp_id
chk_ids <- c()
for(i in c(sampvars, "measurement_ref")){
chk_ids <- c(chk_ids, sprintf("%s: %s", i, length(unique(df[[i]]))))
}
if(nrow(df %>% filter(is.na(sampvars[length(sampvars)]))) != 0){
warning(sprintf("there are observations without %s in the dataset", sampvars[length(sampvars)]))
}
message(sprintf("NOTE will align by %s when grouping to identify duplicates\n", sampvars[length(sampvars)]))
## BBIO DATA
## separate out bbio data for lipid to wet-weight conversion ----
## B-BIO variable code: http://vocab.ices.dk/?CodeID=51634
## b-bio data -- length, weight, fat and lipid content
## lipid content, etc. needed to convert lipid basis samples into the wet weight are in the B-BIO column
message("STEP 2: separate out bbio data and wrangle for use in lipid to wet-weight conversion...\n")
df_bbio <- df %>%
filter(param_group == "B-BIO") %>%
select(
station, latitude, longitude, date, year,
species, matrix_analyzed, basis_determination, qflag,
variable, unit, value,
!!!syms(sampvars)
)
## if there are cm, convert to mm
df_bbio <- df_bbio %>%
left_join(
read_csv(
here::here("supplement", "lookup_tabs", "unit_conversion.csv"),
col_types = cols()
) %>% filter(ConvertUnit == "mm"),
by = c("unit" = "OriginalUnit")
) %>%
mutate(
value = ifelse(is.na(ConvertUnit), value, value*ConvertFactor),
unit = ifelse(is.na(ConvertUnit), unit, "mm")
) %>%
select(-ConvertUnit, -ConvertFactor)
## confirm that units of bbio, are one-to-one
chk_bbio_units <- df_bbio %>%
select(variable, unit) %>%
distinct()
## if dealing with sediment, need to keep units for now...
if(matrix_biota){df_bbio <- select(df_bbio, -unit)}
chk_num_bbio_dup <- nrow(df_bbio) - nrow(distinct(df_bbio %>% select(-value)))
## BBIO DUPLICATES
## identify, explore, handle bbio duplicates ----
## group to identify duplicates; for a given sub_samp_ref only some variables may have duplicates
grpvars <- c(
"station", "year", "species", "basis_determination", "matrix_analyzed", "qflag",
sampvars, "date", "variable", "longitude", "latitude"
)
chk <- c()
for(i in 1:6){
bbio_duplicates <- df_bbio %>%
group_by(!!!syms(c(grpvars[i], grpvars[7:length(grpvars)]))) %>%
summarise(n = n()) %>%
filter(n > 1) %>%
ungroup()
chk <- c(chk, nrow(filter(bbio_duplicates, n > 1)))
}
if(var(chk) != 0){
message(sprintf(
"NOTE bbio duplicates dataframe dims vary with grpvars: %s",
paste(grpvars[which(chk != chk[which.max(tabulate(match(chk, chk)))])], collapse = ", ")
))
}
## join duplicate table by grouped vars to locate within full bbio dataset
bbio_duplicates <- df_bbio %>%
group_by(!!!syms(grpvars)) %>%
summarise(n = n()) %>%
filter(n > 1) %>%
ungroup()
df_bbio <- left_join(df_bbio, bbio_duplicates, by = grpvars)
## take averages for variable/sample duplicates
df_bbio <- bind_rows(
## averages of duplicates
df_bbio %>%
filter(n > 1) %>%
group_by(!!!syms(grpvars)) %>%
summarize(value = mean(value)) %>%
ungroup(),
## rows of bbio data with no duplicates
df_bbio %>% filter(is.na(n)) %>% select(-n)
)
## check for unique variables and matrix_analyzed, per species
chk_bbio_vars <- NULL
if(matrix_biota){
chk_bbio_vars <- do.call(
rbind,
lapply(
as.list(unique(df_bbio$species)),
function(x){
cbind(
df_bbio %>%
filter(species == x) %>%
select(matrix_analyzed, basis_determination, variable) %>%
distinct() %>%
arrange(variable, matrix_analyzed, basis_determination),
species = x
)
}
)
)
}
## WIDE BBIO DATA
## bbio data wide format ----
## remove basis_determ. as some bio vars are recorded for whole organism, others for liver or muscle,
## no concentration variables measured for whole organism matrix, so don't need for joining...
## no problem pivoting wider also indicates uniquely identified by remaining grouping variables
# unique(filter(df, param_group != "B-BIO")$matrix_analyzed)
if(contam_param == "FL"){
## for PFOS we have liver and muscle measurements
## need to deal with matrix_analyzed to get 'muscle equivalent' from conc. measurements in liver
df_bbio_wide <- df_bbio %>%
mutate(variable = paste(matrix_analyzed, variable, sep = "_")) %>%
select(-matrix_analyzed, -basis_determination)
} else {
df_bbio_wide <- df_bbio %>%
select(-basis_determination)
}
df_bbio_wide <- df_bbio_wide %>%
tidyr::pivot_wider(names_from = variable, values_from = value)
## which columns are needed to uniquely identify and later rejoin with congener conc. data
## i.e. at this stage, which subset of grpvars gives groupings such that all n are one?
# for(i in grpvars){print(paste(i, length(unique(df_bbio_wide[[i]]))))}
chkgrpvars <- intersect(grpvars, names(df_bbio_wide))
chk <- chkgrpvars
for(i in 1:length(chkgrpvars)){
bbio_wide_duplicates <- df_bbio_wide %>%
group_by(!!!syms(setdiff(chk, chkgrpvars[i]))) %>%
summarise(n = n()) %>%
ungroup() %>%
filter(n > 1)
if(nrow(bbio_wide_duplicates) == 0){
chk <- setdiff(chk, chkgrpvars[i])
}
}
message(sprintf(
"NOTE minimum group ID variables for later rejoining uniquely with congener data:\n%s\n",
paste(chk, collapse = ", ")
))
chk_bbio_wide_ids <- chk
## sub_samp_id and samp_ref together are as unique as sub_samp_ref
# nrow(select(df_bbio_wide, sub_samp_ref) %>% distinct())
# nrow(select(df_bbio_wide, sub_samp_id, samp_ref) %>% distinct())
## CONGENER CONCENTRATIONS DATA
## check congener concentration data, units and duplicates ----
if(!matrix_biota){
## units of depth variables meters
## only keep sediment measurements in top 2cm,
## or maximum 5cm if filtering to within 2cm of surface yeilds too few data pts...
df <- df %>% filter(depth_low <= 0.05)
}
df_conc <- df %>%
filter(str_detect(param_group, pattern = contam_param)) %>%
select(
station, latitude, longitude, date, year,
species, matrix_analyzed, basis_determination,
variable, unit, value, detect_lim, quant_lim, qflag,
!!!syms(sampvars)
)
## convert congener concentration units all to ug/kg
message("STEP 3: convert congener concentrations units to ug/kg")
df_conc <- df_conc %>%
left_join(
read_csv(
here::here("supplement", "lookup_tabs", "unit_conversion.csv"),
col_types = cols()
) %>% filter(ConvertUnit == "ug/kg"),
by = c("unit" = "OriginalUnit")
) %>%
mutate(
value = value*ConvertFactor,
detect_lim = detect_lim*ConvertFactor,
unit = "ug/kg"
) %>%
select(-ConvertUnit, -ConvertFactor)
## check whether variables vs units relationship is one-to-one
chk_conc_units <- df_conc %>%
select(variable, unit) %>%
distinct() %>%
arrange(variable)
df_conc <- select(df_conc, -unit)
## take averages for congener conc. variable x sample duplicates ----
## identify duplicates
## group to identify duplicates,
## for a given sub_samp_ref only some vars/congeners may have duplicates
message("STEP 4: averaging, where multiple congener concentrations observations per sub sample\n")
chk_num_conc_dup <- nrow(df_conc) - nrow(distinct(df_conc %>% select(-value, -detect_lim, -quant_lim)))
conc_duplicates <- df_conc %>%
group_by(!!!syms(grpvars)) %>%
summarise(n = n()) %>%
## no cases where duplicates with-and-without qflags, else would keep cases unflagged observations...
# group_by(!!!syms(setdiff(grpvars, "qflag"))) %>%
# mutate(allflagged = !any(is.na(qflag))) %>%
# filter(n > 1, !allflagged) %>% ...
filter(n > 1) %>%
ungroup()
## join duplicates tables by grouped vars to locate them in original dataset
df_conc <- left_join(df_conc, conc_duplicates, by = grpvars)
chk_conc_dup <- filter(df_conc, !is.na(n))
df_conc <- bind_rows(
## averages of duplicates
df_conc %>%
filter(n > 1) %>%
group_by(!!!syms(grpvars)) %>%
summarize(
value = mean(value),
detect_lim = mean(detect_lim, na.rm = TRUE),
quant_lim = mean(detect_lim, na.rm = TRUE)
) %>%
ungroup(),
## rows of conc data with no duplicates
df_conc %>% filter(is.na(n)) %>% select(-n)
)
## remaining duplicates for biota datasets are due to dual wet/lipid weight records;
## of these, keep wet weight records, when has a value
if(matrix_biota){
chk <- df_conc %>%
group_by(!!!syms(setdiff(grpvars, c("basis_determination", "matrix_analyzed")))) %>%
mutate(n = n(), ww_val = any(basis_determination == "wet weight" & !is.na(value))) %>%
filter(n > 1) %>%
mutate(
rm_duplicate =
(ww_val & basis_determination == "lipid weight")|
(!ww_val & basis_determination == "wet weight")
) %>%
ungroup()
df_conc <- df_conc %>%
left_join(chk, by = intersect(names(df_conc), names(chk))) %>%
mutate(rm_duplicate = ifelse(is.na(rm_duplicate), TRUE, rm_duplicate)) %>%
filter(rm_duplicate) %>%
select(-n, -ww_val, -rm_duplicate)
}
## do not want to spread the data because the qflag, detli etc is unique to each congener
## but, spread to check if one row for every unique sub_samp_ref for biota, samp_ref for sediment
df_conc_wide <- df_conc %>%
select(-detect_lim, -quant_lim) %>%
tidyr::pivot_wider(names_from = variable, values_from = value)
if(length(unique(df_conc[[sampvars[length(sampvars)]]])) != nrow(distinct(df_conc_wide))){
message(
"NOTE conc. subsamples not uniquely identified even after averaging;
multiple obs per ID, probably due to qflags\n"
)
}
## for biota convert lipid weight measurements to wet weight ----
## UNITS HAVE ALL BEEN CONVERTED TO MICROGRAMS PER KILOGRAM
## use the df_bbio_wide datatable to convert values in df_conc table
df_final <- left_join(df_conc, df_bbio_wide, by = intersect(grpvars, names(df_bbio_wide)))
if(matrix_biota){
message("STEP 5: converting all congener concentrations in biota to wet weight, muscle equivalence")
## for PFOS data convert liver values into those for muscle
## using conversion values from Faxneld et al 2014
## https://www.diva-portal.org/smash/get/diva2:767385/FULLTEXT01.pdf)
## follow methods in the HELCOM core indicator p.9
## http://www.helcom.fi/Core%20Indicators/PFOS_HELCOM%20core%20indicator%202016_web%20version.pdf
## use the mean liver:muscle ratio for all species (17.9), see Table 8 in Faxneld et al 2014
if(contam_param == "FL"){
df_final <- df_final %>%
mutate_at(
vars(c("value", "detect_lim", "quant_lim")),
list(pfos_muscle_equiv = ~ case_when(
variable == "PFOS" & matrix_analyzed == "liver" ~ ./17.9,
variable == "PFOS" & matrix_analyzed == "muscle" ~ .,
## NOT IN BHI1.0, BUT INCLUDING FOR INITIAL EXPLORATION...
variable == "PFOA" & species == "Clupea harengus" & matrix_analyzed == "liver" ~ ./13.4,
variable == "PFOA" & species == "Clupea harengus" & matrix_analyzed == "muscle" ~ .,
variable == "PFOA" & species == "Zoarces viviparus" & matrix_analyzed == "liver" ~ ./6.13,
variable == "PFOA" & species == "Zoarces viviparus" & matrix_analyzed == "muscle" ~ .,
variable == "PFOA" & species == "Perca fluviatilis" & matrix_analyzed == "liver" ~ ./13.7,
variable == "PFOA" & species == "Perca fluviatilis" & matrix_analyzed == "muscle" ~ .,
variable == "PFDA" & species == "Clupea harengus" & matrix_analyzed == "liver" ~ ./19.4,
variable == "PFDA" & species == "Clupea harengus" & matrix_analyzed == "muscle" ~ .,
variable == "PFDA" & species == "Zoarces viviparus" & matrix_analyzed == "liver" ~ ./5.69,
variable == "PFDA" & species == "Zoarces viviparus" & matrix_analyzed == "muscle" ~ .,
variable == "PFDA" & species == "Perca fluviatiliss" & matrix_analyzed == "liver" ~ ./5.69,
variable == "PFDA" & species == "Perca fluviatilis" & matrix_analyzed == "muscle" ~ .,
variable == "PFNA" & species == "Clupea harengus" & matrix_analyzed == "liver" ~ ./15.5,
variable == "PFNA" & species == "Clupea harengus" & matrix_analyzed == "muscle" ~ .,
variable == "PFNA" & species == "Zoarces viviparus" & matrix_analyzed == "liver" ~ ./5.62,
variable == "PFNA" & species == "Zoarces viviparus" & matrix_analyzed == "muscle" ~ .,
variable == "PFNA" & species == "Perca fluviatilis" & matrix_analyzed == "liver" ~ ./13.7,
variable == "PFNA" & species == "Perca fluviatilis" & matrix_analyzed == "muscle" ~ .,
variable == "PFPEDA" & matrix_analyzed == "liver" ~ ./7.4,
variable == "PFPEDA" & matrix_analyzed == "muscle" ~ .,
variable == "PFUnDA" & matrix_analyzed == "liver" ~ ./10.1,
variable == "PFUnDA" & matrix_analyzed == "muscle" ~ .
))
) %>%
mutate_at(
vars(c("value_pfos_muscle_equiv", "detect_lim_pfos_muscle_equiv", "quant_lim_pfos_muscle_equiv")),
## will probably all be in wet weight already, but just in case...
list(wet_wgt = ~ case_when(
basis_determination == "lipid weight" ~ .*(`muscle_EXLIP%`/100), # IS THIS HOW EXLIP% VARIABLE WORKS??
basis_determination == "dry weight" ~ .*(100/`muscle_DRYWT%`), # IS THIS HOW DRYWT% VARIABLE WORKS????
basis_determination == "wet weight" ~ .
))
)
} else {
df_final <- df_final %>%
## if value is not wet weight convert, otherwise keep value
mutate_at(
vars(c("value", "detect_lim", "quant_lim")),
list(wet_wgt = ~ case_when(
basis_determination == "lipid weight" ~ .*(`EXLIP%`/100), # IS THIS HOW EXLIP% VARIABLE WORKS????
basis_determination == "dry weight" ~ .*(100/`DRYWT%`), # IS THIS HOW DRYWT% VARIABLE WORKS????
basis_determination == "wet weight" ~ .
))
)
}
}
## for sediment data, still need to convert to dry weights
if(!matrix_biota){
## what to do with the dry weights....
message("STEP 5: converting all congener concentrations in sediment to dry weight")
df_final <- df_final %>%
## if value is not dry weight convert, otherwise keep value
## even though some have units of ug/kg treat the DRYWT% as percentage...
mutate_at(
vars(c("value", "detect_lim", "quant_lim")),
list(dry_wgt = ~ case_when(
basis_determination == "wet weight" ~ .*(`DRYWT%`/100), # IS THIS HOW DRYWT% VARIABLE WORKS????
basis_determination == "dry weight" ~ .
))
)
}
## rejoin with meta information ----
## rejoin with info on country and monitoring program info, qflaugs, etc
message("FINAL STEP: rejoining with meta information\n")
df_final <- df %>%
filter(str_detect(param_group, pattern = contam_param)) %>%
select(
country, monit_program, monit_purpose,
report_institute,
monit_year, date_ices, day, month,
num_indiv_subsample, bulk_id,
!!!syms(grpvars)
) %>%
distinct() %>%
right_join(df_final, by = grpvars) %>%
mutate(qflagged = ifelse(is.na(qflag), FALSE, TRUE))
## return wrangled data and checks ----
## list of checks as part of result
chks <- list(
chk_params, chk_ids,
chk_bbio_units, chk_num_bbio_dup, chk_bbio_vars,
chk_conc_units, chk_num_conc_dup, chk_conc_dup
)
df_final <- filter(df_final, !is.na(value))
if(!all(mutate(group_by(df_final, !!!syms(grpvars)), n = n())$n == 1)){
warning("NOTE rows of final dataframe are not uniquely identified by grouping variables")
}
return(list(
df_final = df_final,
groupingvars = grpvars,
df_conc_wide = df_conc_wide,
df_bbio_wide = df_bbio_wide,
checks = chks
))
}
Apply the Standardization function defined above to all 5 Contaminants datasets and save the cleaned, harmonized data for later analysis. For PCBs in biota data, remove PCB, SCB, SCB7 variables as these are summarized data or depreciated codes; for PCBs in sediments remove PCB and SCB.
dir_interm <- here::here("data", "CW", "contaminants", version_year, "intermediate")
lapply(
list(
## for param definitions see //vocab.ices.dk/?ref=37
## PCB datasets
## remove PCB, SCB, SCB7 - these are summarized data or depreciated codes
## OC-CB variable codes: vocab.ices.dk/?CodeID=26983
list("pcb_bio", TRUE, "OC-CB", c("PCB", "SCB", "SCB7")),
list("pcb_sed", FALSE, "OC-CB", c("PCB", "SCB")),
## Dioxin datasets
list("dioxin_bio", TRUE, "OC-DX", NULL),
list("dioxin_sed", FALSE, "OC-DX", NULL),
## PFOS dataset
list("pfos_bio", TRUE, "O-FL", NULL)
),
function(dat){
dat_final <- standardize_con_data(get(dat[[1]]), dat[[2]], dat[[3]], dat[[4]])
assign(
x = paste(dat[[1]], "cleaned", sep = "_"),
value = dat_final,
envir = .GlobalEnv
)
assign(
x = paste(dat[[1]], "clean_df", sep = "_"),
value = dat_final[["df_final"]],
envir = .GlobalEnv
)
write_csv(
dat_final[["df_final"]],
file.path(dir_interm, paste(dat[[1]], "cleaned.csv", sep = "_"))
)
}
)
## to start running code from here, will need to load in cleaned datasets...
# dir_interm <- here::here("data/CW/contaminants/v2019/intermediate")
# lapply(list.files(dir_interm), function(dat){
# dat_final <- read_csv(file.path(dir_interm, dat))
# assign(
# x = str_replace(dat, "cleaned.csv", "clean_df"),
# value = dat_final,
# envir = .GlobalEnv
# )
# })
make_plotdf <- function(x, yrs){
grpvars <- c("year", "latitude", "longitude", "variable", "num_distinct_dates", "species")
valvar <- "value_wet_wgt"
if(str_detect(x, "sed")){
grpvars <- setdiff(grpvars, "species")
valvar <- "value_dry_wgt"
}
if(str_detect(x, "pfos")){
valvar <- "value_pfos_muscle_equiv_wet_wgt"
}
cbind(
get(x) %>%
filter(!is.na(!!!syms(valvar))) %>%
filter(year %in% yrs, longitude > 9) %>%
group_by(latitude, longitude, variable) %>%
## number of disinct dates at the location for the variable
mutate(num_distinct_dates = n_distinct(date)) %>%
group_by(!!!syms(grpvars)) %>%
## mean annual values at the locatioin, by variable
summarize(
mean_val = ifelse(
str_detect(x, "sed"),
mean(value_dry_wgt),
ifelse(
str_detect(x, "pfos"),
mean(value_pfos_muscle_equiv_wet_wgt),
mean(value_wet_wgt)
)
)
) %>%
ungroup(),
source = str_to_upper(str_remove(x, "_clean_df")))
}
## apply make_plotdf bringing all datasets together
plotdf <- bind_rows(
do.call(
rbind,
lapply(
list("pcb_bio_clean_df", "dioxin_bio_clean_df", "pfos_bio_clean_df"),
function(x) make_plotdf(x, 2014:2019)
)) %>% mutate(matrix = "biota", indicator = str_remove(source, "_[A-Z].*")),
do.call(
rbind,
lapply(
list("pcb_sed_clean_df", "dioxin_sed_clean_df"),
function(x) make_plotdf(x, 2014:2019)
)) %>% mutate(matrix = "sediment", indicator = str_remove(source, "_[A-Z].*"))
)
The maps below show data distribution spatially. Size and color correspond to number of distinct dates measurements were collected for a variable, for the given location within the specified time period (2016 through 2019); larger points with more yellow color indicate greater number of observations while smaller more red points indicate fewer observations. Opacity corresponds to number of years with greater transparency indicating fewer years of data. These faceted plots are created for multiple species; the species selected for visualization are those with the greatest number of measurements recorded across the three indicators.
make_map <- function(mapdata){
## make data spatial using lat/lon columns
df <- sf::st_as_sf(
x = mapdata,
agr = "identity",
coords = c("longitude", "latitude"),
crs = 4326
)
## contaminants datasets overlaying basemap
speciesmap <- basemap +
geom_sf(
data = df,
aes(
size = num_distinct_dates,
color = num_distinct_dates),
show.legend = FALSE,
alpha = 0.5,
shape = 19
) +
scale_color_gradient2(
low = "red3",
mid = "lightcoral",
high = "lightgoldenrod",
midpoint = 2.5
) +
facet_wrap(
c("indicator", "variable"),
labeller = label_wrap_gen(width = 35, multi_line = FALSE),
ncol = 5
)
return(speciesmap)
}
# for(i in unique(plotdf$indicator)){
# species_obs <- plotdf %>%
# filter(matrix == "biota", indicator == i) %>%
# group_by(species) %>%
# summarise(n = sum(num_distinct_dates)) %>%
# arrange(desc(n))
# print(head(species_obs))
# }
make_map(filter(plotdf, matrix == "biota", str_detect(species, "Clupea harengus")))
make_map(filter(plotdf, matrix == "biota", str_detect(species, "Zoarces viviparus")))